Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/90 qdrant summary and raw engine #99

Merged
merged 10 commits into from
Nov 14, 2024

Conversation

amindadgar
Copy link
Member

@amindadgar amindadgar commented Nov 14, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new enumeration for data types: INTEGER, STRING, BOOLEAN, FLOAT.
    • Added a new DualQdrantRetrievalEngine class for enhanced data retrieval capabilities.
    • Launched TelegramDualQueryEngine to improve query processing for Telegram users.
  • Improvements

    • Enhanced Telegram query engine setup with conditional handling for summary collections.
    • Streamlined the BaseQdrantEngine class for better performance and reduced dependencies.
  • Utilities

    • New utility class QdrantEngineUtils for managing data filters and combining query results.
  • Bug Fixes

    • Removed outdated unit tests for the BaseQdrantEngine, which may impact verification of related functionalities.

@amindadgar amindadgar linked an issue Nov 14, 2024 that may be closed by this pull request
Copy link
Contributor

coderabbitai bot commented Nov 14, 2024

Walkthrough

This pull request introduces several enhancements across multiple files, primarily focusing on the implementation of new classes and modifications to existing query engines. A new enumeration DataType is added to define various data types. The query_multiple_source function is updated to incorporate a dual query engine for Telegram. Additionally, new utility classes and methods are introduced to facilitate data retrieval and processing, particularly for the Qdrant vector store. Overall, these changes aim to improve the flexibility and functionality of the query handling system.

Changes

File Change Summary
schema/type.py Added enumeration class DataType with members: INTEGER, STRING, BOOLEAN, FLOAT.
subquery.py Modified query_multiple_source to check for telegram_summary. Added import for TelegramDualQueryEngine.
utils/query_engine/__init__.py Added import for DualQdrantRetrievalEngine. Updated imports for TelegramDualQueryEngine and TelegramQueryEngine.
utils/query_engine/base_qdrant_engine.py Updated imports, removed collection_name initialization, and modified prepare method to return BaseQueryEngine.
utils/query_engine/dual_qdrant_retrieval_engine.py Added DualQdrantRetrievalEngine class with methods for query execution and engine setup.
utils/query_engine/qdrant_query_engine_utils.py Added QdrantEngineUtils class with methods for filtering and combining nodes.
utils/query_engine/telegram.py Added TelegramDualQueryEngine class with a prepare method for query processing.
tests/unit/test_base_qdrant_engine.py Removed TestBaseQdrantEngine class, eliminating associated unit tests for BaseQdrantEngine.

Possibly related PRs

Poem

In the burrows deep and wide,
New engines hop, a joyful ride!
Data types now dance and play,
Telegram queries find their way.
With dual paths, they swiftly glide,
In code's embrace, we take great pride! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

the base qdrant engine is now using the structure of our dual qdrant engine. in future we'll be removing the BaseQdrantEngine
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (10)
schema/type.py (1)

4-8: Consider enhancing the enum with docstring, type hints, and additional data types.

The enum could benefit from:

  1. Class-level documentation explaining its purpose and usage
  2. Additional common data types that might be needed for metadata (e.g., DATETIME, ARRAY, OBJECT)
  3. Type hints for better static analysis

Consider applying these enhancements:

 from enum import Enum
+from typing import Any
 
 
-class DataType(Enum):
+class DataType(str, Enum):
+    """Enumeration of supported data types for metadata fields.
+    
+    This enum is used for specifying metadata date formats and filtering in query engines.
+    Inheriting from str allows for direct string comparison without accessing .value
+    """
     INTEGER = "INTEGER"
     STRING = "STRING"
     BOOLEAN = "BOOLEAN"
     FLOAT = "FLOAT"
+    DATETIME = "DATETIME"
+    ARRAY = "ARRAY"
+    OBJECT = "OBJECT"
+
+    def validate_value(self, value: Any) -> bool:
+        """Validate if a value matches the declared type."""
+        type_validators = {
+            self.INTEGER: lambda x: isinstance(x, int),
+            self.FLOAT: lambda x: isinstance(x, float),
+            self.STRING: lambda x: isinstance(x, str),
+            self.BOOLEAN: lambda x: isinstance(x, bool),
+            self.DATETIME: lambda x: isinstance(x, (str, int)) and self._is_valid_datetime(x),
+            self.ARRAY: lambda x: isinstance(x, (list, tuple)),
+            self.OBJECT: lambda x: isinstance(x, dict)
+        }
+        return type_validators[self](value)
subquery.py (1)

129-138: Consider adding logging for engine selection

Adding debug logging would help track which engine is being used and why, making it easier to diagnose issues in production.

+ import logging
+ 
+ logger = logging.getLogger(__name__)
+
  if telegram and check_collection("telegram"):
      # checking if the summaries was available
      if check_collection("telegram_summary"):
+         logger.debug("Using TelegramDualQueryEngine with summaries")
          telegram_query_engine = TelegramDualQueryEngine(
              community_id=community_id
          ).prepare()
      else:
+         logger.debug("Falling back to TelegramQueryEngine (no summaries available)")
          telegram_query_engine = TelegramQueryEngine(
              community_id=community_id
          ).prepare()
utils/query_engine/base_qdrant_engine.py (2)

25-33: Consider adding error handling for setup_engine

The DualQdrantRetrievalEngine.setup_engine method may raise exceptions if initialization fails. Consider adding error handling to manage potential exceptions and ensure robustness.


25-33: Consider adding a docstring to the prepare method

Including a docstring for the prepare method would enhance code readability and maintainability by providing clear documentation of its purpose and usage.

utils/query_engine/qdrant_query_engine_utils.py (2)

90-91: Correct the docstring to reference the proper metadata key

In the docstring for raw_nodes, self.me seems incorrect. It should refer to self.metadata_date_key to accurately describe the metadata field used.

Apply this diff to correct the docstring:

    raw_nodes : list[NodeWithScore]
-       list of raw nodes containing metadata with self.me as float timestamp
+       list of raw nodes containing metadata with `self.metadata_date_key` as a float timestamp
        and 'text' field

116-120: Preserve summary order when removing duplicates

Using a set to remove duplicates from summaries may alter the original order, which could be important for readability or contextual meaning. To preserve order while removing duplicates, consider using an ordered approach.

Apply this diff to maintain the order:

- summaries = summary_text.split("\n")
- summary_bullets = set(summaries)
+ summary_bullets = []
+ for line in summary_text.split("\n"):
+     if line and line not in summary_bullets:
+         summary_bullets.append(line)
  if "" in summary_bullets:
      summary_bullets.remove("")
utils/query_engine/dual_qdrant_retrieval_engine.py (4)

53-53: Avoid shadowing built-in function filter

The variable name filter shadows the built-in Python function filter(). This can lead to unexpected behavior and reduced code readability. Consider renaming the variable to something like qdrant_filter or filters.

Apply this diff to rename the variable:

-            filter = utils.define_raw_data_filters(dates=dates)
+            qdrant_filter = utils.define_raw_data_filters(dates=dates)

And update its usage accordingly.


15-24: Encapsulate qa_prompt as an instance variable

The qa_prompt is defined at the module level but used within the class methods. For better encapsulation and flexibility, consider setting qa_prompt as an instance variable initialized during object creation.

Apply this diff to move qa_prompt into the class:

-qa_prompt = PromptTemplate(
+class DualQdrantRetrievalEngine(CustomQueryEngine):
+
+    def __init__(self, ..., qa_prompt=None, ...):
+        if qa_prompt is None:
+            qa_prompt = PromptTemplate(
                 "Context information is below.\n"
                 "---------------------\n"
                 "{context_str}\n"
                 "---------------------\n"
                 "Given the context information and not prior knowledge, "
                 "answer the query.\n"
                 "Query: {query_str}\n"
                 "Answer: "
             )
+        self.qa_prompt = qa_prompt

This change allows each instance to have its own prompt template.


95-96: Ensure consistent collection naming conventions

In both setup_engine and setup_engine_with_summaries, the collection_name is constructed using f"{community_id}_{platform_name}". To prevent unintended behavior due to inconsistent naming, make sure community_id and platform_name are consistently formatted and validated.

Also applies to: 154-155


201-203: Add error handling when loading the index

When loading the vector store index, there's a possibility of encountering exceptions (e.g., if the collection doesn't exist). Adding error handling will make the code more robust and provide clearer error messages.

Apply this diff to include error handling:

     qdrant_vector = QDrantVectorAccess(collection_name=collection_name)
-    index = qdrant_vector.load_index()
+    try:
+        index = qdrant_vector.load_index()
+    except Exception as e:
+        raise RuntimeError(f"Failed to load index for collection '{collection_name}': {e}")
     return index

This catches exceptions during index loading and raises a descriptive error.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c5056be and 42e8a97.

📒 Files selected for processing (7)
  • schema/type.py (1 hunks)
  • subquery.py (2 hunks)
  • utils/query_engine/__init__.py (1 hunks)
  • utils/query_engine/base_qdrant_engine.py (2 hunks)
  • utils/query_engine/dual_qdrant_retrieval_engine.py (1 hunks)
  • utils/query_engine/qdrant_query_engine_utils.py (1 hunks)
  • utils/query_engine/telegram.py (1 hunks)
🔇 Additional comments (7)
schema/type.py (2)

1-8: LGTM! The basic enum implementation is clean and follows Python conventions.

The implementation provides a good foundation for type categorization.


4-8: Verify the enum usage in query engines.

Let's verify how this enum is being used in the query engines to ensure consistent type handling.

✅ Verification successful

Based on the shell script results, I can now generate the final response:

DataType enum is consistently used for date format handling in query engines

The enum is properly utilized in the query engine implementations:

  • In QdrantEngineUtils, it correctly validates and handles INTEGER and FLOAT date formats
  • TelegramQueryEngine uses FLOAT for raw data and STRING for summaries
  • The implementation includes proper validation, throwing an error for unsupported date formats
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how DataType enum is used in query engines
# Expected results: Find usage in setup_engine_with_summaries and define_raw_data_filters

# Search for DataType usage in query engine files
echo "Searching for DataType usage in query engines..."
rg -A 5 "DataType" "utils/query_engine/"

# Search for metadata date format handling
echo "Searching for metadata date format handling..."
rg -A 5 "metadata_date_format" "utils/query_engine/"

Length of output: 8343

utils/query_engine/__init__.py (2)

9-9: Verify Telegram engine changes and backward compatibility.

The split into TelegramDualQueryEngine and TelegramQueryEngine suggests a significant enhancement. Please ensure:

  1. Existing functionality remains intact
  2. Integration with the new DualQdrantRetrievalEngine is properly implemented

Let's verify these aspects:

#!/bin/bash
# Description: Verify Telegram engine implementations and their integration
# Expected: Find both class definitions and their relationships

# Check the relationship between Telegram engines and DualQdrantRetrievalEngine
ast-grep --pattern 'class TelegramDualQueryEngine {
  $$$
  prepare($_) {
    $$$
  }
  $$$
}'

# Look for potential breaking changes in the original TelegramQueryEngine
ast-grep --pattern 'class TelegramQueryEngine {
  $$$
  prepare($_) {
    $$$
  }
  $$$
}'

# Check for any references to these engines in test files
rg "TelegramDualQueryEngine|TelegramQueryEngine" -t python "tests/"

2-2: LGTM! Verify the DualQdrantRetrievalEngine implementation.

The import aligns with the PR objectives for implementing Qdrant summary and raw engine functionality.

Let's verify the implementation of this new engine:

✅ Verification successful

Import of DualQdrantRetrievalEngine is properly implemented

The implementation is complete and well-structured with:

  • Core query functionality through custom_query method
  • Setup methods for both regular and summary-based retrieval
  • Proper integration with Qdrant through QDrantVectorAccess
  • Essential dependencies for vector store operations and LLM integration
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the DualQdrantRetrievalEngine implementation
# Expected: Find the class definition and its core methods

# Look for the class definition and its structure
ast-grep --pattern 'class DualQdrantRetrievalEngine {
  $$$
  setup_engine_with_summaries($_) {
    $$$
  }
  $$$
}'

# Check for any TODO or FIXME comments that might indicate incomplete implementation
rg -i "TODO|FIXME|XXX|HACK|BUG" "dual_qdrant_retrieval_engine.py"

Length of output: 288


Script:

#!/bin/bash
# Let's try a different approach to verify the implementation

# First, let's find the actual file
fd "dual_qdrant.*\.py$"

# Then let's examine its content with Python-specific pattern
rg "class DualQdrantRetrievalEngine" -A 5

# Look for any method definitions within this class
rg "^[[:space:]]*def" --type py dual_qdrant

# Check imports to understand dependencies
rg "^from|^import" --type py dual_qdrant

Length of output: 794


Script:

#!/bin/bash
# Let's examine the complete implementation using the full path
rg "^[[:space:]]*def" utils/query_engine/dual_qdrant_retrieval_engine.py

# Check imports
rg "^from|^import" utils/query_engine/dual_qdrant_retrieval_engine.py

# Look for any usage of Qdrant client or configuration
rg "Qdrant|qdrant" utils/query_engine/dual_qdrant_retrieval_engine.py

Length of output: 1546

utils/query_engine/telegram.py (1)

1-6: LGTM! Imports are well-organized and necessary.

All imports are properly utilized within the implementation.

subquery.py (1)

16-16: LGTM: Import statement properly added

The new import for TelegramDualQueryEngine is correctly placed within the existing query engine imports.

utils/query_engine/base_qdrant_engine.py (1)

1-4: Imports are appropriate and necessary

The newly added imports are required for the updated implementation and are correctly specified.

utils/query_engine/telegram.py Show resolved Hide resolved
utils/query_engine/telegram.py Show resolved Hide resolved
subquery.py Show resolved Hide resolved
utils/query_engine/qdrant_query_engine_utils.py Outdated Show resolved Hide resolved
utils/query_engine/qdrant_query_engine_utils.py Outdated Show resolved Hide resolved
utils/query_engine/dual_qdrant_retrieval_engine.py Outdated Show resolved Hide resolved
utils/query_engine/dual_qdrant_retrieval_engine.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
utils/query_engine/dual_qdrant_retrieval_engine.py (2)

14-23: Consider moving the prompt template into the class

The qa_prompt template should be moved into the class as a class constant to improve encapsulation and make it easier to modify or extend in subclasses.

 class DualQdrantRetrievalEngine(CustomQueryEngine):
+    DEFAULT_PROMPT = PromptTemplate(
+        "Context information is below.\n"
+        "---------------------\n"
+        "{context_str}\n"
+        "---------------------\n"
+        "Given the context information and not prior knowledge, "
+        "answer the query.\n"
+        "Query: {query_str}\n"
+        "Answer: "
+    )

132-135: Fix docstring parameter mismatch

The docstring mentions a use_summary parameter that doesn't exist in the method signature.

-        use_summary : bool
-            whether to use the summary data or not
-            note: the summary data should be available before
-            for this option to be enabled
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 42e8a97 and f305345.

📒 Files selected for processing (4)
  • tests/unit/test_base_qdrant_engine.py (0 hunks)
  • utils/query_engine/dual_qdrant_retrieval_engine.py (1 hunks)
  • utils/query_engine/qdrant_query_engine_utils.py (1 hunks)
  • utils/query_engine/telegram.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/unit/test_base_qdrant_engine.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • utils/query_engine/qdrant_query_engine_utils.py
  • utils/query_engine/telegram.py

codeRabbitAI suggestions.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
utils/query_engine/dual_qdrant_retrieval_engine.py (3)

14-23: Add type hints to the prompt template

Consider adding type hints to improve IDE support and documentation:

-qa_prompt = PromptTemplate(
+qa_prompt: PromptTemplate = PromptTemplate(

104-107: Remove or update outdated parameter documentation

The use_summary parameter is documented but not present in the method signature. Either remove the documentation or update the method signature if the parameter is needed.


163-170: Add return type documentation

Add the return type information to the docstring:

        """
        prepare the vector_store for querying data

        Parameters
        ------------
        collection_name : str
            to override the default collection_name
+            
+        Returns
+        -------
+        VectorStoreIndex
+            The loaded vector store index
        """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f305345 and d924922.

📒 Files selected for processing (2)
  • utils/query_engine/dual_qdrant_retrieval_engine.py (1 hunks)
  • utils/query_engine/qdrant_query_engine_utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • utils/query_engine/qdrant_query_engine_utils.py
🔇 Additional comments (1)
utils/query_engine/dual_qdrant_retrieval_engine.py (1)

26-39: LGTM! Clean implementation with good separation of concerns.

The class structure is well-organized with clear delegation to specialized methods for different query types.

@amindadgar amindadgar merged commit b444891 into main Nov 14, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Qdrant summary and raw data query engine!
1 participant